21 research outputs found

    Comparison of Style Features for the Authorship Verification of Literary Texts

    Get PDF
    The article compares character-level, word-level, and rhythm features for the authorship verification of literary texts of the 19th-21st centuries. Text corpora contains fragments of novels, each fragment has a size of about 50 000 characters. There are 40 fragments for each author. 20 authors who wrote in English, Russian, French, and 8 Spanish-language authors are considered.The authors of this paper use existing algorithms for calculation of low-level features, popular in the computer linguistics, and rhythm features, common for the literary texts. Low-level features include n-grams of words, frequencies of letters and punctuation marks, average word and sentence lengths, etc. Rhythm features are based on lexico-grammatical figures: anaphora, epiphora, symploce, aposiopesis, epanalepsis, anadiplosis, diacope, epizeuxis, chiasmus, polysyndeton, repetitive exclamatory and interrogative sentences. These features include the frequency of occurrence of particular rhythm figures per 100 sentences, the number of unique words in the aspects of rhythm, the percentage of nouns, adjectives, adverbs and verbs in the aspects of rhythm. Authorship verification is considered as a binary classification problem: whether the text belongs to a particular author or not. AdaBoost and a neural network with an LSTM layer are considered as classification algorithms. The experiments demonstrate the effectiveness of rhythm features in verification of particular authors, and superiority of feature types combinations over single feature types on average. The best value for precision, recall, and F-measure for the AdaBoost classifier exceeds 90% when all three types of features are combined

    Sentiment classification of long newspaper articles based on automatically generated thesaurus with various semantic relationships

    Get PDF
    The paper describes a new approach for sentiment classification of long texts from newspapers using an automatically generated thesaurus. An important part of the proposed approach is specialized thesaurus creation and computation of term's sentiment polarities based on relationships between terms. The approach's efficiency has been proved on a corpus of articles about American immigrants. The experiments showed that the automatically created thesaurus provides better classification quality than manual ones, and generally for this task our approach outperforms existing ones

    Text Classification by Genre Based on Rhythm Features

    Get PDF
    The article is devoted to the analysis of the rhythm of texts of different genres: fiction novels, advertisements, scientific articles, reviews, tweets, and political articles. The authors identified lexico-grammatical figures in the texts: anaphora, epiphora, diacope, aposiopesis, etc., that are markers of the text rhythm. On their basis, statistical features were calculated that describe quantitatively and structurally these rhythm features.The resulting text model was visualized for statistical analysis using boxplots and heat maps that showed differences in the rhythm of texts of different genres. The boxplots showed that almost all genres differ from each other in terms of the overall density of rhythm features. Heatmaps showed different rhythm patterns across genres. Further, the rhythm features were successfully used to classify texts into six genres. The classification was carried out in two ways: a binary classification for each genre in order to separate a particular genre from the rest genres, and a multi-class classification of the text corpus into six genres at once. Two text corpora in English and Russian were used for the experiments. Each corpus contains 100 fiction novels, scientific articles, advertisements and tweets, 50 reviews and political articles, i.e. a total of 500 texts. The high quality of the classification with neural networks showed that rhythm features are a good marker for most genres, especially fiction. The experiments were carried out using the ProseRhythmDetector software tool for Russian and English languages. Text corpora contains 300 texts for each language

    A survey on thesauri application in automatic natural language processing

    Get PDF
    This paper is devoted to investigate efficiency of thesauri use in popular natural language processing (NLP) fields: information retrieval and analysis of texts and subject areas. A thesaurus is a natural language resource that models a subject area and can reflect human expert's knowledge in many NLP tasks. The main target of this survey is to determine how much thesauri affect processing quality and where they can provide better performance. We describe studies that use different types of thesauri, discuss contribution of the thesaurus into achieved results, and propose directions for future research in the thesaurus field

    Sentiment Classification of Russian Texts Using Automatically Generated Thesaurus

    Get PDF
    This paper is devoted to an approach for sentiment classification of Russian texts applying an automatic thesaurus of the subject area. This approach consists of a standard machine learning classifier and a procedure embedded into it, that uses the- saurus relationships for better sentiment analysis. The thesaurus is generated fully automatically and does not require expert’s involvement into classification process. Experiments conducted with the approach and four Russian-language text corpora, show effectiveness of thesaurus application to sentiment classification

    ΠšΠ»Π°ΡΡΠΈΡ„ΠΈΠΊΠ°Ρ†ΠΈΡ русскоязычных тСкстов ΠΏΠΎ ΠΆΠ°Π½Ρ€Π°ΠΌ Π½Π° основС соврСмСнных эмбСддингов ΠΈ Ρ€ΠΈΡ‚ΠΌΠ°

    Get PDF
    The article investigates modern vector text models for solving the problem of genre classification of Russian-language texts. Models include ELMo embeddings, BERT language model with pre-training and a complex of numerical rhythm features based on lexico-grammatical features. The experiments were carried out on a corpus of 10,000 texts in five genres: novels, scientific articles, reviews, posts from the social network Vkontakte, news from OpenCorpora. Visualization and analysis of statistics for rhythm features made it possible to identify both the most diverse genres in terms of rhythm: novels and reviews, and the least ones: scientific articles. Subsequently, these genres were classified best with the help of rhythm features and the neural network-classifier LSTM. Clustering and classifying texts by genre using ELMo and BERT embeddings made it possible to separate one genre from another with a small number of errors. The multiclassification F-score reached 99%. The study confirms the efficiency of modern embeddings in the tasks of computational linguistics, and also allows to highlight the advantages and limitations of the complex of rhythm features on the material of genre classification.Π’ ΡΡ‚Π°Ρ‚ΡŒΠ΅ ΠΈΡΡΠ»Π΅Π΄ΡƒΡŽΡ‚ΡΡ соврСмСнныС Π²Π΅ΠΊΡ‚ΠΎΡ€Π½Ρ‹Π΅ ΠΌΠΎΠ΄Π΅Π»ΠΈ тСкстов для Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ Π·Π°Π΄Π°Ρ‡ΠΈ классификации русскоязычных тСкстов ΠΏΠΎ ΠΆΠ°Π½Ρ€Π°ΠΌ. МодСли Π²ΠΊΠ»ΡŽΡ‡Π°ΡŽΡ‚ эмбСддинги ELMo, ΡΠ·Ρ‹ΠΊΠΎΠ²ΡƒΡŽ модСль BERT с ΠΏΡ€Π΅Π΄ΠΎΠ±ΡƒΡ‡Π΅Π½ΠΈΠ΅ΠΌ ΠΈ комплСкс числовых ритмичСских характСристик Π½Π° основС лСксико-грамматичСских срСдств. ЭкспСримСнты ΠΏΡ€ΠΎΠ²ΠΎΠ΄ΠΈΠ»ΠΈΡΡŒ Π½Π° корпусС ΠΈΠ· 10 000 тСкстов пяти ΠΆΠ°Π½Ρ€ΠΎΠ²: Ρ€ΠΎΠΌΠ°Π½Ρ‹, Π½Π°ΡƒΡ‡Π½Ρ‹Π΅ ΡΡ‚Π°Ρ‚ΡŒΠΈ, ΠΎΡ‚Π·Ρ‹Π²Ρ‹, посты ΠΈΠ· ΡΠΎΡ†ΠΈΠ°Π»ΡŒΠ½ΠΎΠΉ сСти Π’ΠΊΠΎΠ½Ρ‚Π°ΠΊΡ‚Π΅, новости ΠΈΠ· OpenCorpora. Визуализация ΠΈ Π°Π½Π°Π»ΠΈΠ· статистики для ритмичСских характСристик ΠΏΠΎΠ·Π²ΠΎΠ»ΠΈΠ»ΠΈ Π²Ρ‹Π΄Π΅Π»ΠΈΡ‚ΡŒ ΠΊΠ°ΠΊ Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ Ρ€Π°Π·Π½ΠΎΠΎΠ±Ρ€Π°Π·Π½Ρ‹Π΅ ΠΏΠΎ Ρ€ΠΈΡ‚ΠΌΡƒ ΠΆΠ°Π½Ρ€Ρ‹: Ρ€ΠΎΠΌΠ°Π½Ρ‹ ΠΈ ΠΎΡ‚Π·Ρ‹Π²Ρ‹, Ρ‚Π°ΠΊ ΠΈ Π½Π°ΠΈΠΌΠ΅Π½Π΅Π΅ - Π½Π°ΡƒΡ‡Π½Ρ‹Π΅ ΡΡ‚Π°Ρ‚ΡŒΠΈ. ИмСнно эти ΠΆΠ°Π½Ρ€Ρ‹ Π±Ρ‹Π»ΠΈ впослСдствии классифицированы Π»ΡƒΡ‡ΡˆΠ΅ всСго с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ Ρ€ΠΈΡ‚ΠΌΠ° ΠΈ нСйросСти-классификатора LSTM. ΠšΠ»Π°ΡΡ‚Π΅Ρ€ΠΈΠ·Π°Ρ†ΠΈΡ ΠΈ классификация тСкстов ΠΏΠΎ ΠΆΠ°Π½Ρ€Π°ΠΌ с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ эмбСддингов ELMo ΠΈ BERT ΠΏΠΎΠ·Π²ΠΎΠ»ΠΈΠ»Π° ΠΎΡ‚Π΄Π΅Π»ΠΈΡ‚ΡŒ ΠΎΠ΄ΠΈΠ½ ΠΆΠ°Π½Ρ€ ΠΎΡ‚ Π΄Ρ€ΡƒΠ³ΠΎΠ³ΠΎ с нСбольшим количСством ошибок. F-ΠΌΠ΅Ρ€Π° ΠΌΡƒΠ»ΡŒΡ‚ΠΈΠΊΠ»Π°ΡΡΠΈΡ„ΠΈΠΊΠ°Ρ†ΠΈΠΈ достигла 99%. ИсслСдованиС ΠΏΠΎΠ΄Ρ‚Π²Π΅Ρ€ΠΆΠ΄Π°Π΅Ρ‚ ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ соврСмСнных эмбСддингов Π² Π·Π°Π΄Π°Ρ‡Π°Ρ… ΠΊΠΎΠΌΠΏΡŒΡŽΡ‚Π΅Ρ€Π½ΠΎΠΉ лингвистики, Π° Ρ‚Π°ΠΊΠΆΠ΅ позволяСт Π²Ρ‹Π΄Π΅Π»ΠΈΡ‚ΡŒ достоинства ΠΈ ограничСния комплСкса ритмичСских характСристик Π½Π° ΠΌΠ°Ρ‚Π΅Ρ€ΠΈΠ°Π»Π΅ классификации ΠΏΠΎ ΠΆΠ°Π½Ρ€Π°ΠΌ

    Sentiment Classification into Three Classes Applying Multinomial Bayes Algorithm, N-grams, and Thesaurus

    Get PDF
    The paper is devoted to development of the method that classi?es texts in English and Russian by sentiments into positive, negative, and neutral. The proposed method is based on the Multinomial Naive Bayes classi?er with additional n-grams application. The classi?er is trained either on three classes, or on two contrasting classes with a threshold to separate neutral texts. Experiments with texts on various topics showed signi?cant improvement of classification quality for reviews from a particular domain. Besides, the analysis of thesaurus relationships application to sentiment classification into three classes was done, however it did not show significant improvement of the classification results

    Анализ использования Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Ρ… Ρ‚ΠΈΠΏΠΎΠ² связСй ΠΌΠ΅ΠΆΠ΄Ρƒ Ρ‚Π΅Ρ€ΠΌΠΈΠ½Π°ΠΌΠΈ тСзауруса, сгСнСрированного с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ Π³ΠΈΠ±Ρ€ΠΈΠ΄Π½Ρ‹Ρ… ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ², Π² Π·Π°Π΄Π°Ρ‡Π°Ρ… классификации тСкстов

    Get PDF
    The main purpose of the article is to analyze how effectively different types of thesaurus relations can be used for solutions of text classification tasks. The basis of the study is an automatically generated thesaurus of a subject area, that contains three types of relations: synonymous, hierarchical and associative. To generate the thesaurus the authors use a hybrid method based on several linguistic and statistical algorithms for extraction of semantic relations. The method allows to create a thesaurus with a sufficiently large number of terms and relations among them. The authors consider two problems: topical text classification and sentiment classification of large newspaper articles. To solve them, the authors developed two approaches that complement standard algorithms with a procedure that take into account thesaurus relations to determine semantic features of texts. The approach to topical classification includes the standard unsupervised BM25 algorithm and the procedure, that take into account synonymous and hierarchical relations of the thesaurus of the subject area. The approach to sentiment classification consists of two steps. At the first step, a thesaurus is created, whose termsΒ weight polarities are calculated depending on the term occurrences in the training set or on the weights of related thesaurus terms. At the second step, the thesaurus is used to compute the features of words from texts and to classify texts by the algorithm SVM or Naive Bayes. In experiments with text corpora BBCSport, Reuters, PubMed and the corpus of articles about American immigrants, the authors varied the types of thesaurus relations that are involved in the classification and the degree of their use. The results of the experiments make it possible to evaluate the efficiency of the application of thesaurus relations for classification of raw texts and to determine under what conditions certain relationships affect more or less. In particular, the most useful thesaurus connections are synonymous and hierarchical, as they provide a better quality of classification. ЦСль Π΄Π°Π½Π½ΠΎΠΉ ΡΡ‚Π°Ρ‚ΡŒΠΈ β€” ΠΏΡ€ΠΎΠ°Π½Π°Π»ΠΈΠ·ΠΈΡ€ΠΎΠ²Π°Ρ‚ΡŒ, насколько эффСктивно ΠΌΠΎΠ³ΡƒΡ‚ ΠΏΡ€ΠΈΠΌΠ΅Π½ΡΡ‚ΡŒΡΡ Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Π΅ Ρ‚ΠΈΠΏΡ‹ тСзаурусных связСй Π² Π·Π°Π΄Π°Ρ‡Π°Ρ… классификации тСкстов. Основой исслСдования являСтся автоматичСски сгСнСрированный тСзаурус ΠΏΡ€Π΅Π΄ΠΌΠ΅Ρ‚Π½ΠΎΠΉ области, содСрТащий Ρ‚Ρ€ΠΈ Ρ‚ΠΈΠΏΠ° связСй: синонимичСскиС, иСрархичСскиС ΠΈ ассоциативныС. Для Π³Π΅Π½Π΅Ρ€Π°Ρ†ΠΈΠΈ тСзауруса ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅Ρ‚ΡΡ Π³ΠΈΠ±Ρ€ΠΈΠ΄Π½Ρ‹ΠΉ ΠΌΠ΅Ρ‚ΠΎΠ΄, основанный Π½Π° Π½Π΅ΡΠΊΠΎΠ»ΡŒΠΊΠΈΡ… лингвистичСских ΠΈ статистичСских Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ°Ρ… выдСлСния сСмантичСских связСй ΠΈ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡŽΡ‰ΠΈΠΉ ΡΠΎΠ·Π΄Π°Ρ‚ΡŒ тСзаурус с достаточно большим числом Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ² ΠΈ связСй ΠΌΠ΅ΠΆΠ΄Ρƒ Π½ΠΈΠΌΠΈ. Авторы Ρ€Π°ΡΡΠΌΠ°Ρ‚Ρ€ΠΈΠ²Π°ΡŽΡ‚ Π΄Π²Π΅ Π·Π°Π΄Π°Ρ‡ΠΈ: тСматичСская классификация тСкстов ΠΈ классификация Π±ΠΎΠ»ΡŒΡˆΠΈΡ… новостных статСй ΠΏΠΎ Ρ‚ΠΎΠ½Π°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ. Для Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ ΠΊΠ°ΠΆΠ΄ΠΎΠΉ ΠΈΠ· Π½ΠΈΡ… Π°Π²Ρ‚ΠΎΡ€Π°ΠΌΠΈ Π±Ρ‹Π»ΠΈ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Π½Ρ‹ Π΄Π²Π° ΠΏΠΎΠ΄Ρ…ΠΎΠ΄Π°, ΠΊΠ°ΠΆΠ΄Ρ‹ΠΉ ΠΈΠ· ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Ρ… дополняСт стандартныС Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΡ‹ ΠΏΡ€ΠΎΡ†Π΅Π΄ΡƒΡ€ΠΎΠΉ, ΠΏΡ€ΠΈΠΌΠ΅Π½ΡΡŽΡ‰Π΅ΠΉ связи тСзауруса для опрСдСлСния сСмантичСских особСнностСй тСкстов. ΠŸΠΎΠ΄Ρ…ΠΎΠ΄ ΠΊ тСматичСской классификации Π²ΠΊΠ»ΡŽΡ‡Π°Π΅Ρ‚ Π² сСбя стандартный Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌ BM25 Π²ΠΈΠ΄Π° Β«ΠΎΠ±ΡƒΡ‡Π΅Π½ΠΈΠ΅ Π±Π΅Π· учитСля» ΠΈ ΠΏΡ€ΠΎΡ†Π΅Π΄ΡƒΡ€Ρƒ, ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡŽΡ‰ΡƒΡŽ синонимичСскиС ΠΈ иСрархичСскиС связи тСзауруса ΠΏΡ€Π΅Π΄ΠΌΠ΅Ρ‚Π½ΠΎΠΉ области. ΠŸΠΎΠ΄Ρ…ΠΎΠ΄ ΠΊ классификации ΠΏΠΎ Ρ‚ΠΎΠ½Π°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ состоит ΠΈΠ· Π΄Π²ΡƒΡ… шагов. На ΠΏΠ΅Ρ€Π²ΠΎΠΌ шагС создаСтся тСзаурус, Ρ‚ΠΎΠ½Π°Π»ΡŒΠ½Ρ‹Π΅ вСса Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ² ΠΊΠΎΡ‚ΠΎΡ€ΠΎΠ³ΠΎ ΡΡ‡ΠΈΡ‚Π°ΡŽΡ‚ΡΡ Π² зависимости ΠΎΡ‚ частоты встрСчаСмости Π² ΠΎΠ±ΡƒΡ‡Π°Π΅ΠΌΠΎΠΉ Π²Ρ‹Π±ΠΎΡ€ΠΊΠ΅ ΠΈΠ»ΠΈ ΠΎΡ‚ вСса сосСдСй ΠΏΠΎ тСзаурусу. На Π²Ρ‚ΠΎΡ€ΠΎΠΌ шагС тСзаурус примСняСтся для вычислСния ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ² слов ΠΈΠ· тСкстов ΠΈ классификации тСкстов ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠΌ ΠΎΠΏΠΎΡ€Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ² ΠΈΠ»ΠΈ Π½Π°ΠΈΠ²Π½Ρ‹ΠΌ байСсовским классификатором. Π’ экспСримСнтах с корпусами BBCSport, Reuters, PubMed ΠΈ корпусом статСй ΠΎΠ± амСриканских ΠΈΠΌΠΌΠΈΠ³Ρ€Π°Π½Ρ‚Π°Ρ… Π°Π²Ρ‚ΠΎΡ€Ρ‹ Π²Π°Ρ€ΡŒΠΈΡ€ΠΎΠ²Π°Π»ΠΈ Ρ‚ΠΈΠΏΡ‹ связСй, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΡƒΡ‡Π°ΡΡ‚Π²ΡƒΡŽΡ‚ Π² классификации, ΠΈ ΡΡ‚Π΅ΠΏΠ΅Π½ΡŒ ΠΈΡ… использования. Π Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Ρ‹ экспСримСнтов ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡŽΡ‚ ΠΎΡ†Π΅Π½ΠΈΡ‚ΡŒ ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ примСнСния тСзаурусных связСй для классификации тСкстов Π½Π° СстСствСнном языкС ΠΈ ΠΎΠΏΡ€Π΅Π΄Π΅Π»ΠΈΡ‚ΡŒ, ΠΏΡ€ΠΈ ΠΊΠ°ΠΊΠΈΡ… условиях Ρ‚Π΅ ΠΈΠ»ΠΈ ΠΈΠ½Ρ‹Π΅ связи ΠΈΠΌΠ΅ΡŽΡ‚ Π±ΠΎΠ»ΡŒΡˆΡƒΡŽ Π·Π½Π°Ρ‡ΠΈΠΌΠΎΡΡ‚ΡŒ. Π’ частности, Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΏΠΎΠ»Π΅Π·Π½Ρ‹ΠΌΠΈ тСзаурусными связями оказались синонимичСскиС ΠΈ иСрархичСскиС, Ρ‚Π°ΠΊ ΠΊΠ°ΠΊ ΠΎΠ½ΠΈ обСспСчиваСт Π»ΡƒΡ‡ΡˆΠ΅Π΅ качСство классификации.

    ΠšΠ»Π°ΡΡΠΈΡ„ΠΈΠΊΠ°Ρ†ΠΈΡ тСкстов ΠΏΠΎ уровням CEFR с использованиСм ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² машинного обучСния ΠΈ языковой ΠΌΠΎΠ΄Π΅Π»ΠΈ BERT

    Get PDF
    This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in natural language is an important component of assessing students knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models were considered based on stylometric numerical features of the character, word, sentence structure levels. The classification of the obtained vectors was carried out by standard machine learning classifiers. The article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, LogisticRegression. Precision, recall and F-score served as quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, were chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 were shown by the Support Vector Classifier with F-score 67 % for the CEFR Levelled English Texts. This approach was compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided the F-score value of 69 %. The analysis of classification errors showed that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depended on the text corpus, that demonstrated a significant difference in F-scores during application of the same text models for different corpora. In general, the obtained results showed the effectiveness of automatic text level detection and the possibility of its practical application.Π’ Π΄Π°Π½Π½ΠΎΠΉ Ρ€Π°Π±ΠΎΡ‚Π΅ прСдставлСно исслСдованиС Π·Π°Π΄Π°Ρ‡ΠΈ автоматичСской классификации ΠΊΠΎΡ€ΠΎΡ‚ΠΊΠΈΡ… связных тСкстов (эссС) Π½Π° английском языкС ΠΏΠΎ уровням ΠΌΠ΅ΠΆΠ΄ΡƒΠ½Π°Ρ€ΠΎΠ΄Π½ΠΎΠΉ ΡˆΠΊΠ°Π»Ρ‹ CEFR. ΠžΠΏΡ€Π΅Π΄Π΅Π»Π΅Π½ΠΈΠ΅ уровня тСкста Π½Π° СстСствСнном языкС являСтся Π²Π°ΠΆΠ½ΠΎΠΉ ΡΠΎΡΡ‚Π°Π²Π»ΡΡŽΡ‰Π΅ΠΉ ΠΎΡ†Π΅Π½ΠΊΠΈ Π·Π½Π°Π½ΠΈΠΉ учащихся, Π² Ρ‚ΠΎΠΌ числС для ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΊΠΈ ΠΎΡ‚ΠΊΡ€Ρ‹Ρ‚Ρ‹Ρ… Π·Π°Π΄Π°Π½ΠΈΠΉ Π² систСмах элСктронного обучСния. Для Ρ€Π΅ΡˆΠ΅Π½ΠΈΡ этой Π·Π°Π΄Π°Ρ‡ΠΈ Π±Ρ‹Π»ΠΈ рассмотрСны Π²Π΅ΠΊΡ‚ΠΎΡ€Π½Ρ‹Π΅ ΠΌΠΎΠ΄Π΅Π»ΠΈ тСкста Π½Π° основС стиломСтричСских числовых характСристик уровня символов, слов, структуры прСдлоТСния. ΠšΠ»Π°ΡΡΠΈΡ„ΠΈΠΊΠ°Ρ†ΠΈΡ ΠΏΠΎΠ»ΡƒΡ‡Π΅Π½Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ² ΠΎΡΡƒΡ‰Π΅ΡΡ‚Π²Π»ΡΠ»Π°ΡΡŒ стандартными классификаторами машинного обучСния. Π’ ΡΡ‚Π°Ρ‚ΡŒΠ΅ ΠΏΡ€ΠΈΠ²Π΅Π΄Π΅Π½Ρ‹ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Ρ‹ Ρ‚Ρ€Ρ‘Ρ… Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΡƒΡΠΏΠ΅ΡˆΠ½Ρ‹Ρ…: Support Vector Classifier, Stochastic Gradient Descent Classifier, LogisticRegression. ΠžΡ†Π΅Π½ΠΊΠΎΠΉ качСства послуТили Ρ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ, ΠΏΠΎΠ»Π½ΠΎΡ‚Π° ΠΈ F"=ΠΌΠ΅Ρ€Π°. Для экспСримСнтов Π±Ρ‹Π»ΠΈ Π²Ρ‹Π±Ρ€Π°Π½Ρ‹ Π΄Π²Π° ΠΎΡ‚ΠΊΡ€Ρ‹Ρ‚Ρ‹Ρ… корпуса тСкстов CEFR Levelled English Texts ΠΈ BEA"=2019. Π›ΡƒΡ‡ΡˆΠΈΠ΅ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Ρ‹ классификации ΠΏΠΎ ΡˆΠ΅ΡΡ‚ΠΈ уровням ΠΈ подуровням CEFR ΠΎΡ‚ A1 Π΄ΠΎ C2 ΠΏΠΎΠΊΠ°Π·Π°Π» Support Vector Classifier с F"=ΠΌΠ΅Ρ€ΠΎΠΉ 67 % для корпуса CEFR Levelled English Texts. Π­Ρ‚ΠΎΡ‚ ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ сравнивался с ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ΠΌ языковой ΠΌΠΎΠ΄Π΅Π»ΠΈ BERT (ΡˆΠ΅ΡΡ‚ΡŒ Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Ρ… Π²Π°Ρ€ΠΈΠ°Π½Ρ‚ΠΎΠ²). Π›ΡƒΡ‡ΡˆΠ°Ρ модСль bert"=base"=cased обСспСчила Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅ F"=ΠΌΠ΅Ρ€Ρ‹ 69 %. Анализ ошибок классификации ΠΏΠΎΠΊΠ°Π·Π°Π», Ρ‡Ρ‚ΠΎ большая ΠΈΡ… Ρ‡Π°ΡΡ‚ΡŒ Π΄ΠΎΠΏΡƒΡ‰Π΅Π½Π° ΠΌΠ΅ΠΆΠ΄Ρƒ сосСдними уровнями, Ρ‡Ρ‚ΠΎ Π²ΠΏΠΎΠ»Π½Π΅ объяснимо с Ρ‚ΠΎΡ‡ΠΊΠΈ зрСния ΠΏΡ€Π΅Π΄ΠΌΠ΅Ρ‚Π½ΠΎΠΉ области. ΠšΡ€ΠΎΠΌΠ΅ Ρ‚ΠΎΠ³ΠΎ, качСство классификации сильно зависСло ΠΎΡ‚ корпуса тСкстов, Ρ‡Ρ‚ΠΎ продСмонстрировало сущСствСнноС Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΠ΅ F"=ΠΌΠ΅Ρ€Ρ‹ Π² Ρ…ΠΎΠ΄Π΅ примСнСния ΠΎΠ΄ΠΈΠ½Π°ΠΊΠΎΠ²Ρ‹Ρ… ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ тСкста для Ρ€Π°Π·Π½Ρ‹Ρ… корпусов. Π’ Ρ†Π΅Π»ΠΎΠΌ, ΠΏΠΎΠ»ΡƒΡ‡Π΅Π½Π½Ρ‹Π΅ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Ρ‹ ΠΏΠΎΠΊΠ°Π·Π°Π»ΠΈ ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ автоматичСского опрСдСлСния уровня тСкста ΠΈ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡ‚ΡŒ Π΅Π³ΠΎ практичСского примСнСния

    РусскоязычныС тСзаурусы: Π°Π²Ρ‚ΠΎΠΌΠ°Ρ‚ΠΈΠ·ΠΈΡ€ΠΎΠ²Π°Π½Π½ΠΎΠ΅ построСниС ΠΈ ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ Π² Π·Π°Π΄Π°Ρ‡Π°Ρ… ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ тСкстов Π½Π° СстСствСнном языкС

    Get PDF
    The paper reviews the existing Russian-language thesauri in digital form and methods of their automatic construction and application. The authors analyzed the main characteristics of open access thesauri for scientific research, evaluated trends of their development, and their effectiveness in solving natural language processing tasks. The statistical and linguistic methods of thesaurus construction that allow to automate the development and reduce labor costs of expert linguists were studied. In particular, the authors considered algorithms for extracting keywords and semantic thesaurus relationships of all types, as well as the quality of thesauri generated with the use of these tools. To illustrate features of various methods for constructing thesaurus relationships, the authors developed a combined method that generates a specialized thesaurus fully automatically taking into account a text corpus in a particular domain and several existing linguistic resources. With the proposed method, experiments were conducted with two Russian-language text corpora from two subject areas: articles about migrants and tweets. The resulting thesauri were assessed by using an integrated assessment developed in the previous authors’ study that allows to analyze various aspects of the thesaurus and the quality of the generation methods. The analysis revealed the main advantages and disadvantages of various approaches to the construction of thesauri and the extraction of semantic relationships of different types, as well as made it possible to determine directions for future study.Π’ Ρ€Π°Π±ΠΎΡ‚Π΅ Π²Ρ‹ΠΏΠΎΠ»Π½Π΅Π½ ΠΎΠ±Π·ΠΎΡ€ ΡΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΡ… элСктронных русскоязычных тСзаурусов ΠΈ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² ΠΈΡ… автоматичСского построСния ΠΈ примСнСния. Авторы ΠΏΡ€ΠΎΠ²Π΅Π»ΠΈ Π°Π½Π°Π»ΠΈΠ· основных характСристик тСзаурусов, находящихся Π² ΠΎΡ‚ΠΊΡ€Ρ‹Ρ‚ΠΎΠΌ доступС, для Π½Π°ΡƒΡ‡Π½Ρ‹Ρ… исслСдований, ΠΎΡ†Π΅Π½ΠΈΠ»ΠΈ Π΄ΠΈΠ½Π°ΠΌΠΈΠΊΡƒ ΠΈΡ… развития ΠΈ ΡΡ„Ρ„Π΅ΠΊΡ‚ΠΈΠ²Π½ΠΎΡΡ‚ΡŒ Π² Ρ€Π΅ΡˆΠ΅Π½ΠΈΠΈ Π·Π°Π΄Π°Ρ‡ ΠΏΠΎ ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠ΅ СстСствСнного языка. Π‘Ρ‹Π»ΠΈ исслСдованы статистичСскиС ΠΈ лингвистичСскиС ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ построСния тСзаурусов, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡŽΡ‚ Π°Π²Ρ‚ΠΎΠΌΠ°Ρ‚ΠΈΠ·ΠΈΡ€ΠΎΠ²Π°Ρ‚ΡŒ Ρ€Π°Π·Ρ€Π°Π±ΠΎΡ‚ΠΊΡƒ ΠΈ ΡƒΠΌΠ΅Π½ΡŒΡˆΠΈΡ‚ΡŒ Π·Π°Ρ‚Ρ€Π°Ρ‚Ρ‹ Π½Π° Ρ‚Ρ€ΡƒΠ΄ экспСртов-лингвистов. Π’ частности, Ρ€Π°ΡΡΠΌΠ°Ρ‚Ρ€ΠΈΠ²Π°Π»ΠΈΡΡŒ Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΡ‹ выдСлСния ΠΊΠ»ΡŽΡ‡Π΅Π²Ρ‹Ρ… Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ² ΠΈΠ· тСкстов ΠΈ сСмантичСских тСзаурусных связСй всСх Ρ‚ΠΈΠΏΠΎΠ², Π° Ρ‚Π°ΠΊΠΆΠ΅ качСство примСнСния ΠΏΠΎΠ»ΡƒΡ‡ΠΈΠ²ΡˆΠΈΡ…ΡΡ Π² Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Π΅ ΠΈΡ… Ρ€Π°Π±ΠΎΡ‚Ρ‹ тСзаурусов. Для наглядной ΠΈΠ»Π»ΡŽΡΡ‚Ρ€Π°Ρ†ΠΈΠΈ особСнностСй Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Ρ… ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² построСния тСзаурусных связСй Π±Ρ‹Π» Ρ€Π°Π·Ρ€Π°Π±ΠΎΡ‚Π°Π½ ΠΊΠΎΠΌΠ±ΠΈΠ½ΠΈΡ€ΠΎΠ²Π°Π½Π½Ρ‹ΠΉ ΠΌΠ΅Ρ‚ΠΎΠ΄, Π³Π΅Π½Π΅Ρ€ΠΈΡ€ΡƒΡŽΡ‰ΠΈΠΉ спСциализированный тСзаурус ΠΏΠΎΠ»Π½ΠΎΡΡ‚ΡŒΡŽ автоматичСски Π½Π° основС корпуса тСкстов ΠΏΡ€Π΅Π΄ΠΌΠ΅Ρ‚Π½ΠΎΠΉ области ΠΈ Π½Π΅ΡΠΊΠΎΠ»ΡŒΠΊΠΈΡ… ΡΡƒΡ‰Π΅ΡΡ‚Π²ΡƒΡŽΡ‰ΠΈΡ… лингвистичСских рСсурсов. Π‘ использованиСм ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π½ΠΎΠ³ΠΎ ΠΌΠ΅Ρ‚ΠΎΠ΄Π° Π±Ρ‹Π»ΠΈ ΠΏΡ€ΠΎΠ²Π΅Π΄Π΅Π½Ρ‹ экспСримСнты с русскоязычными корпусами тСкстов ΠΈΠ· Π΄Π²ΡƒΡ… ΠΏΡ€Π΅Π΄ΠΌΠ΅Ρ‚Π½Ρ‹Ρ… областСй: ΡΡ‚Π°Ρ‚ΡŒΠΈ ΠΎ ΠΌΠΈΠ³Ρ€Π°Π½Ρ‚Π°Ρ… ΠΈ Ρ‚Π²ΠΈΡ‚Ρ‹. Для Π°Π½Π°Π»ΠΈΠ·Π° ΠΏΠΎΠ»ΡƒΡ‡Π΅Π½Π½Ρ‹Ρ… тСзаурусов использовалась комплСксная ΠΎΡ†Π΅Π½ΠΊΠ°, разработанная Π°Π²Ρ‚ΠΎΡ€Π°ΠΌΠΈ Π² ΠΏΡ€Π΅Π΄Ρ‹Π΄ΡƒΡ‰Π΅ΠΌ исслСдовании, которая позволяСт ΠΎΠΏΡ€Π΅Π΄Π΅Π»ΠΈΡ‚ΡŒ Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Π΅ аспСкты тСзауруса ΠΈ качСство ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² Π΅Π³ΠΎ Π³Π΅Π½Π΅Ρ€Π°Ρ†ΠΈΠΈ. ΠŸΡ€ΠΎΠ²Π΅Π΄Ρ‘Π½Π½Ρ‹ΠΉ Π°Π½Π°Π»ΠΈΠ· выявил основныС достоинства ΠΈ нСдостатки Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Ρ… ΠΏΠΎΠ΄Ρ…ΠΎΠ΄ΠΎΠ² ΠΊ ΠΏΠΎΡΡ‚Ρ€ΠΎΠ΅Π½ΠΈΡŽ тСзаурусов ΠΈ Π²Ρ‹Π΄Π΅Π»Π΅Π½ΠΈΡŽ сСмантичСских связСй Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Ρ… Ρ‚ΠΈΠΏΠΎΠ², Π° Ρ‚Π°ΠΊΠΆΠ΅ ΠΏΠΎΠ·Π²ΠΎΠ»ΠΈΠ» ΠΎΠΏΡ€Π΅Π΄Π΅Π»ΠΈΡ‚ΡŒ ΠΏΠΎΡ‚Π΅Π½Ρ†ΠΈΠ°Π»ΡŒΠ½Ρ‹Π΅ направлСния Π±ΡƒΠ΄ΡƒΡ‰ΠΈΡ… исслСдований.
    corecore